This document explains the methodology of the correlations analysis for San Jose social distancing compliance and summarizes key results. It uses data on social distancing through 5/4/2020.

library(tidyverse)
library(plotly)
library(sf)
library(mapview)
library(tigris)
library(censusapi)
library(leaflet)
library(lehdr)
library(usmap)


options(
  tigris_class = "sf",
  tigris_use_cache = TRUE
)

Methodology

The data used for social distancing compliance comes from Safegraph’s social distancing dataset. In this analysis, we used specifically the data on devices “completely at home,” which Safegraph defines as devices that did not leave their usual nighttime location (see documentation at https://docs.safegraph.com/docs/social-distancing-metrics). For each census block group in San Jose, we calculated the average percent of devices completely at home on weekdays since the start of the Bay Area shelter-in-place order (3/16/2020), as well as the percent of devices completely at home on weekdays during the months of January and February 2020, prior to the shelter-in-place order and widespread COVID-19 concerns. From these results, we obtain the percent of devices leaving home during these time periods.

In our analysis, we examined the correlations between percent of devices leaving the home before and after the shelter in-place-order was instated and various demographic variables, including income, age, language ability, race, ethnicity, education level, vehicle ownership, occupants per room in a household, sex of workers, and high speed internet access. Information on the demographic variables at the census block group level was obtained from the American Community Survey 2018 data. We also assessed the correlations between these demographic variables and the change in percent of devices staying completely at home after the shelter-in-place order relative to before the order. This latter metric should indicate the ability of a community alter their behavior to comply with the shelter-in-place order.

Results

Here we present a summary of the key significant results from our correlations analysis.

Income alone is a strong predictor

Income was a strong predictor of percent of devices leaving the home during the shelter-in-place order period. We considered different income thresholds, and concluded that percent of households earning over 125,000 annually was the best predictor. In the graph below, the percent of devices leaving the home is plotted against the percent of households making over 125,000, with each block group represented as a point on the graph. The best fit linear trendline is shown in orange. The slider on the bottom switches the data between percent of devices leaving home before the shelter-in-place order to after the shelter-in-place order.

# load data
sj_dem_distancing_pre_post <- readRDS("/Users/simonespeizer/Documents/2020 Spring Quarter/CEE 218Z/covid19/Simone_Speizer/sj_socialdistancing_demdata_prepostdifs_manyvars.rds")

# combine the data so that plots can be animated with trendlines
# get the before shelter in place data
sj_dem_distancing_pre_shelter <- sj_dem_distancing_pre_post %>% dplyr::select(`% not completely at home pre shelter`, blockgroup) 
sj_dem_distancing_pre_shelter[is.na(sj_dem_distancing_pre_shelter)] <- 0

#  relabel column
colnames(sj_dem_distancing_pre_shelter)[1] <- "% leaving home"

# add back demographic variables
sj_dem_distancing_pre_shelter <- sj_dem_distancing_pre_shelter %>% left_join(sj_dem_distancing_pre_post) 

# get trendlines
sj_dem_distancing_pre_shelter <- sj_dem_distancing_pre_shelter %>%
  mutate(
    income_trendline = fitted(lm((sj_dem_distancing_pre_shelter)$`% leaving home` ~ (sj_dem_distancing_pre_shelter)$`% over 125,000`)),
    hispanic_trendline = fitted(lm((sj_dem_distancing_pre_shelter)$`% leaving home` ~ (sj_dem_distancing_pre_shelter)$`% non hispanic/latino`)),
    educ_trendline = fitted(lm((sj_dem_distancing_pre_shelter)$`% leaving home` ~ (sj_dem_distancing_pre_shelter)$`percent associates or higher`))) %>%
  cbind(`Before or After Shelter-in-Place` = "Before shelter-in-place")

# repeat for post shelter in place
sj_dem_distancing_post_shelter <- sj_dem_distancing_pre_post %>% dplyr::select(`% not completely at home`, blockgroup) 

sj_dem_distancing_post_shelter[is.na(sj_dem_distancing_post_shelter)] <- 0

#  relabel column
colnames(sj_dem_distancing_post_shelter)[1] <- "% leaving home"

# add back demographic variables
sj_dem_distancing_post_shelter <- sj_dem_distancing_post_shelter %>% left_join(sj_dem_distancing_pre_post) 

# get trendlines
sj_dem_distancing_post_shelter <- sj_dem_distancing_post_shelter %>%
  mutate(
    income_trendline = fitted(lm((sj_dem_distancing_post_shelter)$`% leaving home` ~ (sj_dem_distancing_post_shelter)$`% over 125,000`)),
    hispanic_trendline = fitted(lm((sj_dem_distancing_post_shelter)$`% leaving home` ~ (sj_dem_distancing_post_shelter)$`% non hispanic/latino`)),
    educ_trendline = fitted(lm((sj_dem_distancing_post_shelter)$`% leaving home` ~ (sj_dem_distancing_post_shelter)$`percent associates or higher`))) %>%
  cbind(`Before or After Shelter-in-Place` = "After shelter-in-place")

# combine them
sj_dem_distancing_pre_post_separate <- rbind(sj_dem_distancing_pre_shelter, sj_dem_distancing_post_shelter)

# convert the before/after column to factor so it shows up correctly on the plots
sj_dem_distancing_pre_post_separate$`Before or After Shelter-in-Place` <- factor(sj_dem_distancing_pre_post_separate$`Before or After Shelter-in-Place`, levels = c("Before shelter-in-place", "After shelter-in-place"))


fig_income <- 
  plot_ly (sj_dem_distancing_pre_post_separate) %>%
    add_trace(
      x = ~`% over 125,000`, 
      y = ~`% leaving home`, 
      frame = ~`Before or After Shelter-in-Place`, 
      type = 'scatter', 
      mode = 'markers', 
      showlegend = F
    ) %>% 
    add_trace(
      x = ~`% over 125,000`,
      y = ~income_trendline,
      type = 'scatter',
      mode = 'lines',
      line = list(size = 5, color = 'rgba(255, 165, 0, 1)'),
      frame = ~`Before or After Shelter-in-Place`,
      showlegend = F
    ) %>% 
  animation_button(visible = F) %>%
  animation_slider(
    pad = list(t =75),
    currentvalue = list(visible=F)
  ) %>% 
  layout(xaxis = list(title = 'Percent of households making over $125,000'), yaxis = list(title = 'Percent of devices leaving home'), margin = list(l = 75,r = 75))

fig_income

From this figure, we see that during the shelter-in-place order period a higher percentage of households making over 125,000 in a block group correlates with fewer devices leaving the home in that block group. This trend is the opposite of that observed prior to the shelter-in-place order, suggesting that block groups with a greater percentage of households of higher income were more able to adjust their behavior to comport with the shelter-in-place order.

To better assess this relative change in behavior, we fit a linear model to the change in devices staying completely at home after the shelter-in-place order (relative to before the order) and the percent of households earning more than 125,000. The results of that model, including the coefficient on income (the slope of the linear fit) and the R-squared value, are shown below.

Coefficient:

income_125_model_dif <- lm(`% increase in staying completely home` ~ `% over 125,000`, sj_dem_distancing_pre_post)
print(summary.lm(income_125_model_dif)$coefficients, digits  = 4, signif.stars=TRUE)
##                  Estimate Std. Error t value  Pr(>|t|)
## (Intercept)       13.2697    0.77542   17.11 3.084e-53
## `% over 125,000`   0.3097    0.01713   18.08 5.072e-58

R-squared:

print(summary.lm(income_125_model_dif)$r.squared, digits  = 4)
## [1] 0.3656

From the coefficient value, we see that as the percent of households with incomes over 125,000 increases by 1%, the difference between the percent of devices staying completely at home after the shelter-in-place order and the percent completely at home before the order increases by about 0.3%. The R-squared value assess the degree to which this model accurately predicts the variation in change in devices staying completely at home observed in the data; the result of 0.37 indicates that the linear fit with income predicts about 37% of the observed variation. The low p value indicates that these results are significant. This is a relatively strong prediction, even without examining the effect of other demographic variables.

Education level is also a strong predictor on its own, but is highly correlated with income

Education level–specifically percent of individuals in a block group that have a degree at the Associate’s level or higher–also well predicts percent of devices leaving the home during the shelter-in-place order period.

fig_educ <- 
  plot_ly (sj_dem_distancing_pre_post_separate) %>%
    add_trace(
      x = ~`percent associates or higher`, 
      y = ~`% leaving home`, 
      frame = ~`Before or After Shelter-in-Place`, 
      type = 'scatter', 
      mode = 'markers', 
      showlegend = F
    ) %>% 
    add_trace(
      x = ~`percent associates or higher`,
      y = ~educ_trendline,
      type = 'scatter',
      mode = 'lines',
      line = list(size = 5, color = 'rgba(255, 165, 0, 1)'),
      frame = ~`Before or After Shelter-in-Place`,
      showlegend = F
    ) %>% 
  animation_button(visible = F) %>%
  animation_slider(
    pad = list(t =75),
    currentvalue = list(visible=F)
  ) %>% 
  layout(xaxis = list(title = 'Percent of individuals with an Associate Degree or higher'), yaxis = list(title = 'Percent of devices leaving home'), margin = list(l = 75,r = 75))

fig_educ

Similar to the correlation with income, during the shelter-in-place order period a higher percentage of individuals with degrees at the Associate’s level or higher in a block group correlates with fewer devices leaving the home in that block group. This is the opposite of the trend present prior to the shelter-in-place order.

The results of the linear model fitting the change in percent of devices staying completely at home and the percent of individuals with degrees at the Associate’s level or higher are shown below.

Coefficient:

educ_model_dif <- lm(`% increase in staying completely home` ~ `percent associates or higher`, sj_dem_distancing_pre_post)
print(summary.lm(educ_model_dif)$coefficients, digits  = 4, signif.stars=TRUE)
##                                Estimate Std. Error t value  Pr(>|t|)
## (Intercept)                      12.880    0.90096   14.30 8.215e-40
## `percent associates or higher`    0.278    0.01768   15.72 1.689e-46

R-squared:

print(summary.lm(educ_model_dif)$r.squared, digits  = 4)
## [1] 0.3036

As the percent of individuals with degrees at the Associate’s level or higher increases by 1%, the change in percent of devices staying completely at home increases by about 0.28%. This linear model with education predicts about 30% of the observed variation in change in percent of devices staying completely at home. Again, the low p value indicates significance.

As noted, this trend is very similar to that seen in the income data. We also expect income and education to be highly correlated. Thus, it is possible that education level may not provide much more information than income already does. To assess this, we performed a multiple regression analysis on these data. In a multiple regression analysis, combining multiple variables into a single model will either suggest that all the variables included have some explanatory power, or will indicate that once the effect of one or more of the variables is accounted for, some of the other variables lose their predictive ability. The results of the multiple regression analysis for income and education with change in percent of devices staying completely at home are shown below.

Coefficients:

educ_income_model_dif <- lm(`% increase in staying completely home` ~ `percent associates or higher` + `% over 125,000`, sj_dem_distancing_pre_post)
print(summary.lm(educ_income_model_dif)$coefficients, digits  = 4, signif.stars=TRUE)
##                                Estimate Std. Error t value  Pr(>|t|)
## (Intercept)                     11.0467    0.86206  12.814 3.457e-33
## `percent associates or higher`   0.1251    0.02322   5.388 1.046e-07
## `% over 125,000`                 0.2201    0.02358   9.338 2.227e-19

R-squared:

print(summary.lm(educ_income_model_dif)$r.squared, digits  = 4)
## [1] 0.3966

The model with both education and income predicts 40% of the variation in the change in percent of devices staying completely at home. Note that this is an improvement of 3% over the model with just income, which predicted 37% of the variation; this suggests that education and income are indeed strongly correlated, but that education does provide some additional predictive ability beyond that provided by income. Note also that the coefficient on income is larger than the coefficient on education. Similarly, the p value for income is much lower than that for education, though both are still significant.

Hispanic/Latino population appears to be a strong predictor on its own, but loses its predictive ability when combined with income, education, and Asian population

When we considered the Hispanic/Latino population of a block group and percent of devices leaving home, we observed that a greater percentage of non-Hispanic/Latino residents in a block group correlates with a lower percent of devices leaving home.

fig_hisp <- 
  plot_ly (sj_dem_distancing_pre_post_separate) %>%
    add_trace(
      x = ~`% non hispanic/latino`, 
      y = ~`% leaving home`, 
      frame = ~`Before or After Shelter-in-Place`, 
      type = 'scatter', 
      mode = 'markers', 
      showlegend = F
    ) %>% 
    add_trace(
      x = ~`% non hispanic/latino`,
      y = ~hispanic_trendline,
      type = 'scatter',
      mode = 'lines',
      line = list(size = 5, color = 'rgba(255, 165, 0, 1)'),
      frame = ~`Before or After Shelter-in-Place`,
      showlegend = F
    ) %>% 
  animation_button(visible = F) %>%
  animation_slider(
    pad = list(t =75),
    currentvalue = list(visible=F)
  ) %>% 
  layout(xaxis = list(title = 'Percent of residents that are not Hispanic or Latino'), yaxis = list(title = 'Percent of devices leaving home'), margin = list(l = 75,r = 75))

fig_hisp

The results of the linear model fitting the change in percent of devices staying completely at home and the percent of residents that are not Hispanic/Latino are shown below.

Coefficient:

hispanic_model_dif <- lm(`% increase in staying completely home` ~ `% non hispanic/latino`, sj_dem_distancing_pre_post)
print(summary.lm(hispanic_model_dif)$coefficients, digits  = 4, signif.stars=TRUE)
##                         Estimate Std. Error t value  Pr(>|t|)
## (Intercept)              10.4942    1.10795   9.472 7.369e-20
## `% non hispanic/latino`   0.2277    0.01546  14.730 8.085e-42

R-squared:

print(summary.lm(hispanic_model_dif)$r.squared, digits  = 4)
## [1] 0.2768

As the percent of individuals that are not Hispanic/Latino increases by 1%, the change in percent of devices staying completely at home increases by about 0.23%. This linear model with Hispanic/Latino population predicts about 28% of the observed variation in the change in devices staying completely at home, comparable with the linear fit for education that was previously shown. The results are again statistically significant.

However, we hypothesized that the correlation observed here might be related to underlying correlations between Hispanic/Latino population and other demographic variables. To test this, we first performed a multiple regression analysis with both income and Hispanic/Latino population, yielding the following results.

Coefficients:

hispanic_inc_model_dif <- lm(`% increase in staying completely home` ~ `% non hispanic/latino` + `% over 125,000`, sj_dem_distancing_pre_post)
print(summary.lm(hispanic_inc_model_dif)$coefficients, digits  = 4, signif.stars=TRUE)
##                         Estimate Std. Error t value  Pr(>|t|)
## (Intercept)               9.5146    1.01569   9.368 1.746e-19
## `% non hispanic/latino`   0.1018    0.01839   5.535 4.761e-08
## `% over 125,000`          0.2325    0.02175  10.688 2.068e-24

R-squared:

print(summary.lm(hispanic_inc_model_dif)$r.squared, digits  = 4)
## [1] 0.3982

This result indicates that Hispanic/Latino population and income together predict about 40% of the variation in change in percent of devices staying completely at home. This is an improvement of 3% over the model with income alone, indicating that after controlling for income, Hispanic/Latino population still provides some–though more limited–predictive power. Both variables have significant p values.

Hypothesizing that education and other race or ethnicity variables might be additional underlying factors, we next considered a linear model that again includes income and Hispanic/Latino population, but also incorporates education and percent of residents that are Asian; the results are shown below.

Coefficients:

hispanic_inc_educ_asian_model_dif <- lm(`% increase in staying completely home` ~ `% non hispanic/latino` + `% over 125,000` + `percent associates or higher` + `% Asian`, sj_dem_distancing_pre_post)
print(summary.lm(hispanic_inc_educ_asian_model_dif)$coefficients, digits  = 4, signif.stars=TRUE)
##                                Estimate Std. Error t value  Pr(>|t|)
## (Intercept)                     9.35312    1.00046  9.3488 2.056e-19
## `% non hispanic/latino`         0.01427    0.02702  0.5282 5.976e-01
## `% over 125,000`                0.21492    0.02337  9.1974 7.049e-19
## `percent associates or higher`  0.09492    0.03153  3.0105 2.725e-03
## `% Asian`                       0.07275    0.01636  4.4473 1.048e-05

R-squared:

print(summary.lm(hispanic_inc_educ_asian_model_dif)$r.squared, digits  = 4)
## [1] 0.4237

When accounting for education, income, and Asian population, Hispanic/Latino population loses its predictive ability–its p value is no longer significant–though all three of the other variables are significant. Income appears to be the key variable, followed by education and Asian population. This combined model predicts 42% of the variation in the change in percent of devices staying completely at home.

Income, education level, Asian population, child population, and young adult population together provide the greatest predictive ability

As we have shown, income, education level, and Asian population together are very strong predictors of the change in percent of devices staying completely at home. These three variables, when combined with child and young adult population, yielded a model with the greatest predictive ability for the change in percent of devices staying completely at home. Though the two age variables were not strong predictors on their own, when included in a multivariable model they did provide additional predictive power beyond that of a model with only income, education level, and Asian population. The parameters of the best-predicting model are presented below.

Coefficients:

inc_educ_asian_child_yad_model_dif <- lm(`% increase in staying completely home` ~ `% over 125,000` + `percent associates or higher` + `% Asian` + `percent less than 18` + `percent 20-29`, sj_dem_distancing_pre_post)
print(summary.lm(inc_educ_asian_child_yad_model_dif)$coefficients, digits  = 4, signif.stars=TRUE)
##                                Estimate Std. Error t value  Pr(>|t|)
## (Intercept)                      6.6729    1.97741   3.375 7.902e-04
## `% over 125,000`                 0.1684    0.02326   7.242 1.464e-12
## `percent associates or higher`   0.1380    0.02280   6.051 2.622e-09
## `% Asian`                        0.0847    0.01432   5.913 5.834e-09
## `percent less than 18`           0.2290    0.05022   4.559 6.307e-06
## `percent 20-29`                 -0.1444    0.04323  -3.341 8.893e-04

R-squared:

print(summary.lm(inc_educ_asian_child_yad_model_dif)$r.squared, digits  = 4)
## [1] 0.4731

All five of these variables are significant in this model. Higher income, higher education attainment, higher percent of residents that are Asian, and higher percent of residents that are children are all associated with larger increases in percent of devices staying completely at home; higher percent of residents that are ages 20-29 is associated with smaller increases in percent of devices staying completely at home following shelter-in-place. These five variables together predict about 47% of the variation in the change in percent of devices staying completely at home.

Note that income alone predicted about 37% of the change in percent of devices staying completely at home, while adding in education raised this to 40%. Including percent of residents that are Asian boosted the predictive power to 42%, and adding percent of residents that are younger than 18 and between the ages of 20-29 raised it to 47%.

Other results

Variables with some correlation, but that are less important in the overall model

Here we summarize the results for other demographic variables we considered that did have some correlation with changes in staying at home, but that were not found to be important in the highest-predicting multiple regression model. These variables include high speed internet access, occupants per room in a household, English language ability, and Spanish language ability.

The analysis on internet access, specifically the percent of households that have access to high speed internet, was inspired by the paper “Social Distancing, Internet Access and Inequality” by Chiou and Tucker (https://www.nber.org/papers/w26982) that found that the combination of high speed internet access and high income was the key driver of ability to stay at home. We did indeed find a correlation between increase in percent of devices staying completely at home and percent of households with broadband such as cable, fiber optic or DSL (coefficient 0.37, p value < 2e-16, R-squared 0.22). However, high speed internet access did not provide any additional information to a model that already incorporated income; including high speed internet access in a regression with income raised the R-squared value by about 0.009 relative to the R-squared of 0.37 for the model with income alone, and did not provide additional useful information in the multivariable regression model.

Similarly, though the percent of households that have 1 or fewer occupants per room was also correlated with change in devices staying at home (coefficient 0.37, p value < 2e-16, R-squared 0.17), this metric also failed to provide additional predictive power over income alone (R-squared 0.373 for income and occupants per room combined).

Percent of residents speaking English well provided some, but less, predictive power on its own (coefficient 0.37, p value < 2e-16, R-squared 0.12), but again was not significant in multiple regression analyses that incorporated other demographic variables.

Percent of residents speaking Spanish did offer some predictive power (percent of residents NOT speaking Spanish had a coefficient of 0.24, p value < 2e-16, R-squared 0.24) but is a very similar metric to the Hispanic/Latino population variable, and was similarly insignificant when combined with education and income.

Variables without strong correlation

Demographic variables we considered that lacked strong correlations with changes in staying at home include percent of residents ages 65 and older (R-squared 0.04), percent of residents that are white (R-squared 0.005), percent of households with a vehicle available (R-squared 0.08), and percent of workers that are male (R-squared 0.0003).

Our full analyses can be viewed here https://stanfordfuturebay.github.io/simone_sd_correlations_analysis_sj_01.html.